An introduction to Geographic And Taxonomic Occurrence R-based Scrubbing (gatoRs): An R package and workflow for processing biodiversity data

Michelle L. Gaynor, Natalie N. Patten, Douglas E. Soltis, and Pamela S. Soltis

How many digitized records exist for the Southern Appalachian endemic, Galax urceolata (Diapensiaceae)?

  1. Introduce Galax urceolata.
  2. Demonstrate the many ways to download occurrence records for Galax urceolata.
  3. Introduce gatoRs.

Galax urceolata (Diapensiaceae)

  • In the Family Diapensiaceae:
    • Berneuxia, Diapensia, Pyxidanthera, Shortia, and Schizocodon.

  • Southern Appalachian endemic.

Taxonomic Issues 🏴‍☠️🔥🐴

  • Identified in the 1730s by John Clayton and named Galax aphylla.

    • Sent to Jan Fredrick Gronovius who published it as “Anonymos or Belvedere”.

      • All of these specimens were destroyed 🔥.
  • John Mitchell collected Nemophila aphylla (Boraginaceae), but mistakenly labels the specimen as Galax aphylla.

    • Specimens were stolen by pirates 🏴‍☠️ on the way to Linnaeus.

      • Linnaeus accepts John Mitchell's description 🌪️.

Current type from André Michaux 🐴 labeled Galax aphylla, but that name now belongs to Nemophila aphylla.

  • On GBIF, there is only one specimen labeled Galax aphylla L. that does not belong to Galax urceolata (Poir.) Brummitt.

    • Currently, all of these specimen are assigned to Nemophila aphylla.

Downloading Occurrence Records

rgbif

# Load package
library(rgbif)

Option 1: scientificName

gbif_data <- occ_data(scientificName = c("Galax urceolata", "Galax aphylla"), 
                      limit = 8000)
nrow(gbif_data$`Galax urceolata`$data) +  nrow(gbif_data$`Galax aphylla`$data)
## [1] 7097
## Now look at verbatim record
### Make sure the keys are numeric
gbif_data$`Galax urceolata`$data$key <- as.numeric(gbif_data$`Galax urceolata`$data$key)
gbif_data$`Galax aphylla`$data$key <- as.numeric(gbif_data$`Galax aphylla`$data$key)

## Download the verbatim scientificNames
query_gbif <- occ_get_verbatim(key = c(gbif_data$`Galax urceolata`$data$key,
                                       gbif_data$`Galax aphylla`$data$key) , 
                               fields = c("scientificName"))
unique(query_gbif$scientificName)
## [1] "Galax urceolata"                   "Galax urceolata (Poir.) Brummitt" 
## [3] "Galax urceolata (Poiret) Brummitt" "Galax aphylla"                    
## [5] "Galax aphylla L."                  "Galax aphylla hort. non L."       
## [7] "Galax aphylla auct. non L."

Option 2: species key

specieskey <- name_backbone(name = "Galax urceolata")
gbif_data2 <- occ_data(taxonKey = specieskey$speciesKey, 
                       limit = 8000)

ridigbio

# Load package
library(ridigbio)
iDigBio_data <- rbind(idig_search_records(rq=list(scientificname="Galax urceolata")), 
                      idig_search_records(rq=list(scientificname="Galax aphylla")))
nrow(iDigBio_data)
## [1] 1692
unique(iDigBio_data$scientificname)
## [1] "galax urceolata" "galax aphylla"

spocc

# Load package
library(spocc)
spocc_data <- spocc::occ2df(spocc::occ(query = c("Galax urceolata", "Galax aphylla"),
                                       from = c("gbif", "idigbio"), limit = 10000))
nrow(spocc_data)
## [1] 8789
unique(spocc_data$name)
## [1] "Galax urceolata (Poir.) Brummitt"        
## [2] "Solenandria cordifolia P.Beauv. ex Vent."
## [3] "Galax aphylla L."                        
## [4] "galax urceolata"                         
## [5] "galax aphylla"

gatoRs

# Load package
library(gatoRs)
gatoRs_data  <- gators_download(synonyms.list = c("Galax urceolata", "Galax aphylla"))
nrow(gatoRs_data)
## [1] 9118
unique(gatoRs_data$scientificName)
## [1] "Galax urceolata (Poir.) Brummitt"        
## [2] "Galax aphylla L."                        
## [3] "Solenandria cordifolia P.Beauv. ex Vent."
## [4] "Galax urceolata"                         
## [5] "Galax urceolata (Poiret) Brummitt"       
## [6] "Galax urceolaa"                          
## [7] "Galax aphylla"

Take-away

Cleaning Occurrence Records

  • Taxonomic harmonization:

    • taxa_clean()
  • Locality cleaning

    • basic_locality_clean()

    • process_flagged()

  • Remove duplicate records

    • remove_duplicates()
  • Basis cleaning

    • basis_clean()
  • Spatial Correction

    • thin_points()

    • one_point_per_pixel()

  • Downstream data processing

    • citation_bellow()

    • remove_redacted()

    • data_chomp()

Taxonomic harmonization

unique(gatoRs_data$scientificName)
## [1] "Galax urceolata (Poir.) Brummitt"        
## [2] "Galax aphylla L."                        
## [3] "Solenandria cordifolia P.Beauv. ex Vent."
## [4] "Galax urceolata"                         
## [5] "Galax urceolata (Poiret) Brummitt"       
## [6] "Galax urceolaa"                          
## [7] "Galax aphylla"
ex <- taxa_clean(df = gatoRs_data,  
                 synonyms.list = c("Galax urceolata", "Galax aphylla"), 
                 taxa.filter = "fuzzy", 
                 accepted.name = "Galax urceolata") 
## Current scientific names:
## [1] "Galax urceolata (Poir.) Brummitt"        
## [2] "Galax aphylla L."                        
## [3] "Solenandria cordifolia P.Beauv. ex Vent."
## [4] "Galax urceolata"                         
## [5] "Galax urceolata (Poiret) Brummitt"       
## [6] "Galax urceolaa"                          
## [7] "Galax aphylla"
## User selected a(n) fuzzy match.

Locality filtering

Basic Locality Clean

gatoRs_data <- basic_locality_clean(df = gatoRs_data,  
                                    remove.zero = TRUE, # Records at (0,0) are removed
                                    precision = TRUE, # lat and long are rounded 
                                    digits = 2, # round to 2 decimal places
                                    remove.skewed = TRUE)

Quick Map

# Load packages
library(ggplot2)

mapUSA <- borders("state", colour="black", fill="white")

ggplot() +
  mapUSA +
  geom_point(data = gatoRs_data, mapping = aes(x = longitude, 
                                               y = latitude, 
                                               col = factor(aggregator))) +
  coord_sf(xlim = c(-90, -68), ylim = c(25, 50)) +
  ylab("Latitude") +
  xlab("Longitude") +
  labs(col = "Aggregator")